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From genes to protein structure and function: 
novel applications of computational 
approaches in the genomic era 

Jeffrey Skoinick and Jacquelyn S. Fetrow , . 

The genome-sequencing projects are providing a detailed 'parts lisf otiife. A key to connprehending this list is understanding" 
the function of each gene and each protein at various levels. Sequence-based methods for function prediction are inadequate 
because of the multifunctional nature of proteins. However, just knowing the structure of the protein is also insufficient for 
prediction of multiple functional sites. Structural descriptors for protein functional sites are crucial for unlocking the secrets 
in both the sequence and structural-genomics projects. 



Genome-sequencing projects are providing a 
detailed 'pares list' for life. Unforcunately, this list, 
A pordoa of which represents the amino acid 
sequence of ail the proteins m a given genome, does 
noc come wuh an mstrucnon manual. Thac is,, given 
che genome's sequences, one does not necessarily know 
straight away which regions encode proteins, which 
serve a regulatory role and which are responsible for 
the structure and replication of che DNA itself. 

This is not unJike giving a child a list of parts nec- 
essary to create a working automobile. Without die 
necessary expertise, creatmg the Enai, working car torn 
jusc the initial parts list is a nearly impossible task. Simi- 
larly, understanding how to create a complete, func- 
tioning cell given just the sequence of nucleotides . 
found in an organism's genome is a complex problem. 

What is a protein function? 

After a genome is 'sequenced and its complete parts 
list determined, the next goal is to understand the func- 
tion(s) of each part, including that of the proteins. What 
do we mean by protein Emction, the focus of this article? 

Function has many meanings. At one level, the pro- 
tein could be a globular protein, such as an enzyme, 
hormone or andbody, or it could be a structural or 
membrane-bound protein.' Another level is its bio- 
chemical function, such as the chemical reacdon arid 
the substrate' specificity of an enzyme. The regulatory 
molecules or cofactors chat bind to a protein are also 
levels of biochemical Sanction. 

Ac the cellular level, the protein's function would 
involve its mteracdoa with other macromolecules and 
the function and cellular location of such complexes. . 
There is also the_protein's physiological funcrion; that 
is, in which metabohc pathway the protem is involved 
or what physiological role it performs. m the organism. 
Finally, the phenocy pic tlincnon is che role played by 
che protein in the total organism, which is observed by 
deleting or mucacing the gene encoding the protein. 
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Obviously, the complete . characterization of protein 
funcdon is difficult but efforts are under way at all levels , 
including cellular tuncdon^-^. In this ardcle, however, 
we focus on idencifymg the biochemical function of a 
protein given its sequence, a problem that is amenable to 
molecular approaches. 

Sequence-based approaches to function 
prediction 

The sequence-to-Rmcdon approach is che most com- 
monly used function-prediction method. This rohast 
Geld is well developed and,- in the interest of space 
limitadons, we will merely present a brief overview. 

There are two mam flavors of this approach: sequence- 
alignment'*^; and sequence-motif methods such as 
Prosite'^ Blocks.!', Prints'--'^ and Emocif ^. Both the 
alignment and che mo df methods are powerful but a 
recent analysis has demon5Crated their significant limi- 
tadons' 5, suggesring that these methods will increasingly 
fail as the protein-sequence databases become more 
diverse. / 

An extension of these approaches that combines 
protem-sequence with structural informadon has been 
developed and some successes have been reported'^'. . 
However, this method still applies the structural infor- 
madon in a one-diniensional, 'sequence -like' fashion 
and fails to take into account the powerful three- 
dimensional informadon displayed by protem structures. 

In addidon, proteins can gain and lose fiinction dur- 
ing evolution and may, indeed, have muidple functions 
in the cell (Box T). Sequence-to-function methods 
cannot specifically identify these complexities. Inaccu- 
rate use of sequence-co-function methods has led to 
significant funcdon-annotation errors in the sequence, 
databases^^. 

An. alternative approach 

An aiternadve, complementary approach to protein- 
function predicnon uses the sequence-to-scructure-co- 
function paradigm. Here, the goal is to determine the 
structure of the protein of interest and then to identify ■ 
the functionally important residues in that structure. 
Using the chemical structure itself to idendfy functional 
sites is more in line with hpvv the protein actually works 
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la a sense, this is one long-term goaJ of 'scruccural 
y'-nomics' projects which are designed to decer- 
n\ine possible protein folds expenmencaliy, just 
as genome-sequencing projects are determining ail 
. n-Qtem sequences--'. This is m contrast to traditional 
scruccural-biology approaches, in winch one knows the 
protein s function first and only then, if the function is 
SLitTiciendy importanc, determines its scruccure. 

it IS implicidy assumed that having the proteins struc- 
ture vviil provide insights mco its function, thereby fur- 
thering the go:ds of the hunian-genome-sequencing 
project. However, knowing a pro tern s thxee-dimensionaj 
structure is insufficient to determine its function 
(BoK 2). What we really need co an^ilyse and predict the 
multifunctional aspects of proteins is a method spe- 
cihcally co recognize acnve sues and bmdmg: regions m 
these protein structures. 

Acdue-sitc idantijiCLition 

ha order to use a strucrure -based approach to funcDon 
prediction, one must idenrify the key residues respon- 
sibie for a given biochemical acnviry. For many years, 
It has. been suggested that the active sites in proteins are 
better conserved than the overall fold.' Taken co the 
lin^ic, this suggests chat one could not only idenafy dis- 
■ canr ancestors wi-th die same global told and. the same^ 
activKy but also proteins with similar functions but 
distantly related, or possibly unrelated, global folds. 

The validity of this suggestion was demonstrate^ 
empirically by Nussmov and co-workers, who showed 
diac the acave sites of eukaryodc serine proteases, sub- 
cilisms and sulflaydryf proteases exhibit similar structural 
.motifs-'". Furthermore, 'm a-recent modeling study of 
:■ ~liicdiaromyccs cerevisiae proteins, protein functional sites" 
were found to be more conserved than other pares ot 
the protein models^^. Similarly, it has been denion- 
stnired that the catalytic triad of the a/|3 hydrolases 
is structurally better conserved dian other histidine- 
containing triads-^. A comparison of the strucmre of the 
hydrolase caralydc triad to other hisddine-concaimng 
triads shows a distinct bimodal distribution, while a 
similar analysis done with a randomly selected trud shows 
a unimodal distribution- (Fig. 1). 

Kasuya and Thornton-'* generalized this example by 
creating structural analogs of a few Prosite sequence" 
motifs'". For the 20 most-firequendy occurring Prosite 
patterns, the associated local structure is quite disdnct. 
These results provide clear evidence that enzyme active 
sites are indeed more highly conserved than other parts 
of the protein. ' 

Identifying active sites in ex-perimcntal strucUires 
Historically, several groups have attempted to iden- 
^ tify functional sites in proteins; these efforts were 
directed at protein engineering or building functional' 
sites in places where they did not previously .exist. This 
has been successfully accomplished for several metal- 
binding sites-^^^^. However, highly accurate functional- 
site descriptors of the backbone and side -chain atoms were 
required,, fueling the belief that significant atomic detail 
IS required in site descriptors for funcdon identification. 
Highly detailed residue side-cham descriptors ot the 
ctive sites of serine proteases and related proteins have 
oeen used co identify functional sites-^. The use ot these^ 
hiL^Wy detaded motifs has led to the iciendficaCion ot 



Box 1. Proteins are multifunctional 



A common protein characteristic that makes functional analysis based 
only on homoiogy especially difficult is the tendency of proteins to be 
muitifunctional. For instance, lactate dehydrogenase binds NAD, sub- 
strate and ztnc, and performs a redox reaction. Each of these occurs 
at different functional sites that are in close proximity and the combh 
nation of all four sites creates the fully functional protein. 

Other examples of multifunctional proteins are the nucleic-acid-binding 
proteins. For instance, ONA regulatory proteins often contain a ONA- 
binding domain, a multimerization domain and additional sites that bind 
regulatory proteins; a classic example is RecA^^. The 3C rhinovirus 
protease exhibits a proteolytic function as well as an RNA-binding 
functionso.ei. Transcription factors are also complex, multifunctional 
proteinsS2, it is becoming increasingly important to recognize each of 
these different functions of gene products of a newly sequenced gene. 

The. serine-threonine-phosphatase superfamily is a prime example of 
the difficulties of using standard sequence analysis^to recognize the 
multiple functions found in single proteins. This large protein family is 
divided into a number of subfamilies, all of which contain an essential 
phosphatase active site. Subfamilies 1, 2A and 2B exhibit 40% or more 
sequence identity between them^^. However, each of these subfamilies 
is apparently regulated differently in the cell^'^-^^ and observation sug- 
gests that there are different functional sites at which regulation can 
occur. Because the sequence identity between subfamilies is so high, 
standard sequence-similarity methods could easily misctassify new 
sequences as members of the wrong subfamily if the functional sites 
are not carefully considered, as was recently demonstrated'^^; 

These are but a few examples of the multifunctionality of proteins. 
The recognition of this multifunctional nature is of critical importance 
to the genomics field. Usefuf'functionai-annotation methods must con- 
sider all of the specific functions in, a given protein and will not just 
provide a general classification of function. 



several novel functiorial sites in known, high-quaJity 
protein structures'-^**. More automated methods tor 
Ending spatial motifs in protein structures have also ■ 
been described-i-^^o. 

Unfortunately, naost: of these methods require the 
exact placement of atoms within protein backbones and 
side chains, and so have not been shown to be relevant 
to inexact predicted structures. Recendy, however, we 
described the production of iuzzy, inexact descriptors 
of protein functional sites^-. As we wish to apply the 
descriptors to experimental structures as well as to pre- 
dicted procein models, we used only carbon atoms and 
side-chain centers-of-mass posidons. We call these 
descriptors Tuzzy functional forms' (FFFs) and have 
created them for both the disulfide-oxidoreduccase'^-'*' " 
and a/p^hydrolase catalytic active sites-^. 

The disulfide -0X1 do re due Case FFF was applied to 
screen high-resolution structures Srom the Brookhaven 
procein database^-. In a dataset of 364 protein structures, 
the FFF accurately idendfied ail proteins known co 
exhibit the disulfide-oxidoreductase active site '5. In a 
larger dataset of 1501 proteins, the FFF again accurately 
idendSed all proteins with the acdve site. In addidon, 
it identified another procein, Itjm, a serine-threonine 
phosphatase. This result was imaaJly discouraging but 
subsequent sequence alignment and clustering analysis 
strongly suggested that this putadve site might indeed 
be a site of redox regulation in the serine— threonine 
phosphatase-1 subtainily'^^. If confirmed by experiment, 
this result wUI higliiighc the advantages of using struc- 
tural descriptors to analyse multiple hanctional sites in 
proteins. It will also highlight the fact that huHlan 
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Box 2. Knowing a protein's structure does not necessarily 
tell you its function 



Because proteins can have simiiar folds but different functions^3.69^ 
determining the structure of a protein may or may not te!I you some- 
thing about its function. The most well-studied exampie is the (a/p}g 
barrel enzymes, of which tnose-phosphaie isomerase (TIM) is the arche- 
typal representative. Members of this family have similar overall struc- 
tures but different functions, including different active sites, substrate 
specificities and cofactor requirements^O-^i. 

Is this example common? Our own analysis of the 1997 SCOP data- 
base^s shows that the five largest fold families are the ferredoxin- 
like,. the (a/p) barrels, the knottins, the immunoglobulin-like and the 
fiavodoxin-like fold families with 22, 18, 13, 9 and 9. subfamilies, respec- 
tively {Fig. i). In fact, 57 of the SCOP fold families consist of multiple 
- superfamilies. These- data only show the tip of the iceberg, because 
each superfamtly is further composed of protein families and each indi- 
vidual family can have radically different functions. For example, the 
ferredoxin-^ike'superfamily contains families identified as Fe-S ferredoxins, 
ribosomal proteins, DNA-binding proteins and phosphatases, among 
, others. 

After this article was submitted, "a much-more-detailed analysis of the 
SCOP database was published^^. This finds a broad function-structure 
correlation for some structural classes, but also finds a number of 
ubiquitous functions and structures that occur across a number of fam- 
iites. The article provides a useful analysis of the confidence with which 
structure and function can be correlated'^^. Knowing the.protein struc- 
ture by itself is insufficient to annotate a number of functional classes 
and is also insufficient for annotating the specific details of. protein , 
function. 
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Figure i . 

Histogram of the numbers of superfamilies found in each- SCOP fold family. 
These data clearly show- that proteins with similar structures can have different 
functions and demonstrate the difficul^/ of assigning protein function based 
simply on the three-dimensional structure.. The data were taken from the 1997 
distribubon of SCOP (http://scop.mrc-lmb.cam.ac.uk/scop}. For a more<ietailed 
analysis, see Ref. 72. 



' observ:\tion alone is no longer adequate for identifying 
all functional sices in known protein structures. 

To dace, the use of scructure to idendfy function has 
largely focused on high- resolution structures and highly 
detailed descriptors of protein tunctional sites. How- 
ever, the creation of inexact descriptors for functional 
sices opens che way- to the application of these methods 
Co inexact, predicred procem models. The quescion ' 
cemains: how good does a model have to be in order 
CO use FFFs to identity its active sices? 



The state of the art in structure-prediction 
methods 

For pro^teins whose sequence identity is above "-30%. 
one can use homology modeling to build the struc- 
cure"^-*. However, structure predicnon is far more difEcuit 
for proteins chat are not homologous co proteins' with 
known structure. At present, there are cwo approaches for - 
these sequences: ab initio folding'^^^^*^ and threading^'^'^-l 
In ab initio folding, one starts from a random con to r- 
mation and then attempts to assemble che nadve struc- 

■ ture. As this method does not rely on a library of 
pre-existmg folds, it can be used to predict novel 
folds.' The recent CASP3 protein-structure-prediccion 
experiment (http;//PredicnonCencer.llnl.gov/CASP3) 
involved ^che blind prediccion of the scructure of pro- 
teins whose. aemal structure was about co be expen- 
mentaily determined. These results indicate that con- 
siderable, progress has been made^^'-^'*. For helical and 
a/p proteins wich less than 110 residues, structures, 
were often predicted whose backbone root-mean- 
square deviation (RJVISD) from native ranged from 
4-7 A. Progress is being made with the p proteins, too, 
although they remain problematic. Because ab initio 
methods can identify novel folds, these methods could 
be used to help to select sequences likely to yield novel 

. folds in experunental structural-genbmics projects. 

Another approach to cerdary-structure prediction is 
threading- Here, for the sequence of interest, . one 
attempts to find the closest macchmg structure m a 
library of known folds^-*^^. Threading is applicable to 
proteins of up to 500 residues or so and is much faster 
than ab initio approaches. However, threading cannot 
be used to obtain novel folds. 

Ab initio predicted models can be used for automatic 
pro tcin-function prediction 

The results of the recent CASP3 competition sug- 
gest that current modeling methods can often (but not 
always) create inexact prot:ein. models. Are these struc- 
tures useful for identifying funcdonal sites in proteins? 
Using 'the ab initio structure-prediction program 
MONSSTER, che ternary structure of a glucaredoxm, 
lego, was predicted^^. For the lowest-energy model, 
the overall backbone RiVlSD Grom the crystal scructure 
was 5.7 A- ' ' . 

To determine whether this inexact model could be 
used for funcdon idencificadon, the sets of correcdy 
and incorrecdy folded structures were screened with 
the FFF for disulfide-oxidoreductase acdvicy'V The 
FFF uniquely idendned che acrive sice in the correcdy 
folded scructure but not in die incorrecdy folded ones 
(Fig. 2). This is a prooc"-of-principle demonstration chat 
inexact models produced by ab mifio predicdon of 
structure trom sequence can be used for the- subsequent 
predicdon ofbiochenaical function. Ofcourse; improve- 
ments in the method have to be made before such, 
predicnons can.be done on a routine basis. 

Use of predicted stnictiires from threading in 
protein-function prediction 

At present, pracdcal limitations preclude folding an 
endre genome of proteins using ab initio methods^^. 
Threading is more appropriate for acliieving che requisite 
high-throughpuc structure prediccion. Thus, a stand- 
ard threading algonthm^^ has beem-used to^reen all 
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pi-oceins in nine genomes for the disulfide -oxidoreduccase ■ 
.ictive site described above. 

Fine, sequences that. aligned with the structures of 
known disulfide oxidoreductases were idenahed. Then, 
the structure was searched for-matches to the acnve- 
sice cesidues and geometry. For those sequences tor 
which other homologs were available, a sequence- 
conservation profile was constructed--^. If the putative 
:\ccive-sice residues were not conserved in the sequence 
SLibfamily co which the protein belongs, that sequence 
was elinimaced. Otherwise, the sequence is predicted 
to have the function.. , 

Usms tbs sequence-to-structure -to-flincaon method, 
99% of the proteins in the rune genomes that have 
known disulfide -oxidoreductase activicy have been 
found. From 10% to 30% more functional predictions 
are made than by aiternanve sequence-based approaches; - 
similar results are seen for the a'/ |B hydrolases'-^. Sur-. 
prismgly, in spite of the fact that threading algorithms 
h.ive problems generatmg good sequence-to-struccure 
alignments/active sites are often accurately aligned, 
even for very distant matches. This observadon would 
agree with the above experimental results -indicacing 
th.u active sites are well conserved in protein structures. 

[mportandy the false-positive race when using struc- 
tund information is much lower than that found using 
sequence-based approaches, as demonstrated by a 
detailed comparison of the FFF structural approach and 
the Blocks sequence-modf approach (N.. Siew ai^ 
unpublished). In this study, the sequences in eight 
genomes, -including Bacillus sub tilis, were analysed for 
disulfide-oxidoreductase funcdon using the disulfide- 
. oxidoreductase FFF, the diioredoxin Block 00194 and 
■.■■he glutaredoxin Block 00195. If we assume that those , 
- sequences identified by both the FFF and Blocks 
are 'true posidves', we find 13 -such sequences in the 
B. siibulis genome. 

There is, no experimental evidence vaiidadng ail ot 
these 'true positives* and so they are more accurately 
termed 'consensus positives". In order to find these 13 
'consensus positive' sequences, the FFF hits seven false 
positives.. On the other hand, Blocks hits 23 false 
positives (Fig. 3). It was previously suggested that die 
Lise of a functional requirement adds information to 
threading and reduces- the number of false posinves^-. 
These data, mcluding the data shown in Fig. 3, validate- 
rhis claim on a genome-wide basis. 

Of course, as no genome has had the funcdon of ail 
of its proteins ■experimentally armotated, it is imposs- 
ible to know how many other proteins with the speci- 
fied biochemical function were notproperly idenrified. 
This is a critical quesdon for researchers attempting to 
■ predict protein function. ExpenmenCal confirmadon 
will be needed to validate, this or any other method 
fully This points out the need for closely coupling 
computadonal function-predicdon algorithms with 
experiments. 

Weaknesses of using the sequence-to-structure- 
to-function method of function prediction 

Based on studies co date, the identificadon of enzy- 
matic activity requires a model in which the backborie 
VMSD from nadve near the active sites is about 4-5 A. 
l^*edicced models are better at describing the geometry 
in the core of the molecide than in the loops and so 
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Figure 1 

The distribution of root-mean-square distributions (RMSO) between ttie hydrolase 
catalytic triad and ait other histidine-containing triads, shows a bimodai . distribution 
(a); by contrast the RMSD between a randomly selected (non-catalytic) tnad and all 
other histidine-containing triads has a unimodal distribution (b). The His-Ser-Asp 
catalytic trial in the protein-l gpl (Rp2 lipase) (a} and a random histidine-containing 
tnad from 4pga (glutaminase-asparaginase) |b) were structurally aligned to all His* 
containing triads in a database of 1037 proteins^.^. Actual a/p-hydrolase active sites 
(a) and the 4pga site [b] are indicated by blue bars; other. histidine triads that are 
not active sites are indicated by red bars. None of the sites found by matching to the 
4oga were hydrolase active sites. Inset graphs show the full distribution. 

predicting the Rmcrion of a protein whose acdve site is 
in loops may be a problem.' Also, the method can cur- 
rendy only be appUed to enzyme active sices; substrate- 
and ligand-binding sices have not been idendfied using 
the inexact models. Techniques that wiU further refine 
inexact protein models will be quite useful m taking 
the protein analysis to the next step. 

Conclusions 

Although sequence-based approaches to protem- 
function prediction have proved to be very useflil, alter- 
natives are needed to assign the biochemicaJ fiincdon 
of the 30-50% of proteins whose tuncdon cannot be 
assigned^ by any current methods. One emerging 
. approach involves the sequence-to-scrucmre-to-funcdon 
paradigm. Such structures might be provided by struc- 
tural-genomics projects or by struccure-predicdon 
algorithms. Funcdonal assignment is made by screen- 
ing the residting structure against a library ot structural 
descriptors for known active sites or binding regions. ■ 
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Figure 2 

Appiication of the disulfide-oxidoreductase fuzzy functJonat form (FFF) to ab imho 
models of glutaredoxin created by the program MONSSTER shows' that the FFF can 
distinguish between correctly folded and misfolded (or higher-energy) models. The FFF 
. IS shown as two orange balls (representing the cysteines) and a blue ball (represent- 
ing the proline). The protein models are shown as magenta wire models with the active- 
site cysteines and proline shown as yellow and cyan balls, respectively. The FFF clearly 
distinguishes the correct active site in the crystal structure of the glutaredoxin lego 
and the correctly folded, lowest-energy model. The FFF does not match to the active 
sites of any of the higher energy, misfolded structures, four of which are shown here. 
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Figure 3 

Analysis of the Baciilus subtihs genome using the thioredoxin Block 00194. The Blocks 
score (computed using the publicly available BUMPS program) is plotted on the x axis 
and the number of sequehces.foundin each scoring bin is plotted on the y axis. Those 
sequences identified as 'consensus positives' [identified by both the. fuzzy functional 
form (FFF) and the Blockl are shown as red bars. One additional sequence found by 
the FFF, which is likely to be a true positive, is shown as a blue bar. All other 
sequences; putative 'false positives', are shown 'as yellow bars. Using the Blocks 
score at-which all 1 3. of the 'consensus, positives' are found, 23 false positives are 
also found. In its analysis of the 9. subtilis genome, the FFF identifies only seven false 
positives along with the same 13 'consensus positives' (data not shown). 



Depiled descriptors will only work on the experi- 
mentally' determined, high-qu^licy structures. Ideally, 
however, the descriptors should work on both experi- 
mental structures and the cruder models provided by 
certinry-struccure -prediction algonthms. 

The LidvGurages ofsuch an approach are char one need 
not establish an evolucionary relationship m order to 
assit^n function, that -more than one tunction can be 



assii^n.ed to a g.iven protein, [an issue of major impor- 
tance, because proteins are niukituncrionai (.Box 1)] 
and,' ukimately, that having a structure can provide 
deeper insight into the biological mechanism of pro- 
tein tli action and rc^iiation. The disadvantai^es are tiiat 
one needs to have the protein's structure before a fimc- 
don can be assit^ned and that die approach is liniited to 
those tuuctiotis a.«ociated with proteins with at iea.sc 
one solved structure, so chat a hmccional-site deiicriptor 
can be con.strucced.. 

In this sense, structure-co-function assignment can be 
thought of as 'Kmc tional threading' - find the active- 
site match in a hbrary of descriptors for known protein 
active sites. This is. die first step in the long process of' 
using structure to assign ail levels of function, a goal 
that is made increasingly important with the emergence 
of scructuraJ genomics. Based on the progress to date, 
It. IS apparent chat structure will play' an important rale 
m the posc-genomic era ot biology. 
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